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As modern architectures introduce additional heterogeneity and parallelism, we look for ways to 
deal with this that do not involve specialising software to every platform. In this paper, we take 
the Join Calculus, an elegant model for concurrent computation, and show how it can be mapped 
to an architecture by a Cartesian-product-style construction, thereby making use of the calculus' 
inherent non-determinism to encode placement choices. This unifies the concepts of placement and 
scheduling into a single task. 

1 Introduction 

The Join Calculus was introduced as a model of concurrent and distributed computation |]4l. Its elegant 
primitives have since formed the basis of many concurrency extensions to existing languages — both 
functional 131121 and imperative |T, 'F] — and also of libraries m. More recently, there has also been 
work showing that a careful implementation can match, and even exceed, the performance of more 
conventional primitives ff\ . 

However, most of the later work has considered the model in the context of shared-memory multi- 
core systems. We argue that the original Join Calculus assumption of distributed computation with 
disjoint local memories lends itself better to an automatic approach. This paper adapts our previous 
work on Petri-nets ||2l to show how the non-determinism of the Join Calculus can express placement 
choices when mapping programs to heterogeneous systems; both data movement between cores and local 
computation are seen as scheduling choices. Thus we argue that a JVM-style bytecode with Join Calculus 
primitives can form a universal intermediate representation, that not only models existing concurrency 
primitives, but also adapts to different architectures at load-time. 

This paper introduces our construction by considering the following Join Calculus program that sorts 
an array of integers using a merge-sort-\]k& algorithm. There is clearly scope for parallelising both the 
split and merge steps — although this may require moving data to another memory. 

def sort (numbers , k) = 

let N = numbers . length in 
def split (a) = 

let n = a. length 

in if n == 1 then merge (a) 

else split(a[0. . (n/2)-l] ) & split(a[(n/2) . . (n-1)] ) 
merge (a) & merge (b) = 

if a. length + b. length == N then do_merge(a, b, k) 

else do_merge(a, b, merge) 

in split (numbers) 

where dojnerge is a functional-style procedure that merges the sorted arrays a and b into a new sorted 
array that is passed to its continuation (k or merge). We assume moderate familiarity with Join Calculus 
primitives. In particular, we will: 
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• Restrict the Join Calculus to make all data usage explicit, showing that existing programs can be 
desugared into this form (Section |2]i. 

• Briefly show how our existing work manifests itself in the Join Calculus (Section [3]l. 

• Introduce workers to the Join Calculus semantics as a substitute for the resource constraints in the 
Petri-net version (Section IDi. 

We offer a discussion of the scheduling issues and how we believe these to be tractable in Section |5] 
before concluding in Section |6] 



2 The Non-Nested Join Calculus 

As in our previous work, our construction introduces explicit data transfer transitions. For these to 
cover all required data transfers, we disallow references to free variables which may be located on other 
processors — i.e. values that are not received as part of the transition's left-hand-side join pattern. Un- 
fortunately, nested Join Calculus definitions capture values in this way. In our running example, observe 
that both N and k are used implicitly by merge. 

Our formulation of the Join Calculus therefore forbids the nesting of definitions. Instead, programs 
consist of a list of definitions. This necessitates a special type of signal, constructors, that are used to 
create and initialise a join definition. A new version of our program is shown in box "A" of Figure [T] 

However, despite this restriction, nested definitions can easily be encoded by a process similar to 
both lambda-lifting and Java's inner-classes. In particular, any program similar to: 

a(x,k) { 

definition { .ctor NestedO { k(f); } 

f(m) { m(x * 2); } } 
construct NestedO ; 

} 

can be rewritten in a similar-style to: 

definition { .ctor UnNested(x,k) { temp(x); k(f ) ; } 

temp(x) { temp(x); temp(x); } 

f (m) & temp(x) { temp(x); ni(x * 2) ; } } 

a(x,k) { construct UnNested(x,k) ; } 

Unfortunately, the extra signal would cause serialisation of many transitions within the definition. This 
is resolved by the duplication transition that allows us to create as many copies of the x message as 
we require. We rely on the scheduler not to perform excessive duplications. We might also be able to 
optimise this 'peek' behaviour in our compiler. 

As we will later build on our previous work involving Petri-nets [25, it is worth highlighting Odersky's 
discussion |5 1 on the correspondence between the Join Calculus and (coloured) Petri-nets. Just as a Petri- 
net transition has a fixed multi-set of pre-places, each transition in the Join Calculus has a fixed join 
pattern defining its input signals. The key difference is that the Join Calculus is higher-order, allowing 
signals to be passed as values, and for the output signals to depend on its inputs — unlike Petri-nets where 
the post-places of a transition are fixed. This simple modification allows use of continuations to support 
functions. Moreover, while nets are static at runtime, a Join Calculus program can create new instances 
of definitions (containing signals and transition rules) at runtime, and although these cannot match on 
existing signals, existing transitions can send messages to the new signals. 
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definition { 



.ctor sort_x (numbers, k) { 




.ctor sort_y (numbers, k) { 


split _x (numbers) ; 




spl it _y (numbers) ; 


info X (numbers . length, k) ; 




info y (numbers. length, k) ; 


} 




} 


info_x(N, k) { info_x(N,k); info_x(M,k); 


} 


info_y(N, k) { info_y(N,k); info_y(N,k); } 


split_x(a) { 




split_y(a) { 


let n = length (a) ; 




let n = length (a) ; 


if (n == 1) { merge_x(a); 


} 


if(n == 1) { merge_y(a); } 


else { split_x(a[0. . (n/2)-l]) ; 




else { split_y(a[0. . (n/2)-l] ) ; 


split_x(a[(n/2)..(n-l)]); 


} 


split_y(a[(n/2)..(n-l)]); } 


} 




} 


merge_x(a) & merge_x(b) & info_x(N, k) { 




merge_y(a) & merge_y(b) & info_y(N, k) { 


inf o_x(N, k) ; 




inf o_y(N, k) ; 


if (a. length + b. length == M) 




if (a. length + b. length == N) 


do_merge(a, b, k) ; 




do_merge(a, b, k) ; 


else do merge (a, b, merge x) ; 




else do merge(a, b, merge y) ; 


} 1 




} m 



split_x(a) { split_y(a) ; 


} 


split_y(a) { split_x(a); 


} 


merge_x(a) { merge_y(a); 


} 


merge_y(a) { merge_x(a); 




info_x(N, k) { info_y(N, "k 


on y"); } 


info_y(N, k) { info_x(N, "k 


on X"); } H 



} 



Figure 1: Mapped version of merge-sort for a dual-processor system 
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3 Mapping Programs to Heterogeneous Hardware 

We will use the same simple hardware model as in our previous work. This considers each processor to 
be closely tied to a local memory. It then defines interconnects between these. The construction itself will 
be concerned with a finite set of processors P, a set of directed interconnects between these I C P x P, 
and a computability relation C (^P y.R (where R is the set of transition rales in the program), such that 
{p, r) €C implies that the rule r can be executed on the processor p. In our example, we take P = {x, y}, 
/ = {(x,y), (y,x)} and a computability relation equal to P x /?. However, it is easy to imagine more 
complex scenarios — for instance, if one processor lacked floating point support, C would not relate it to 
any transitions using floating point operations. 

A scheduler will also need a cost model, however this is not needed for this work. We would expect 
an affine (i.e. latency plus bandwidth) cost for the interconnect. In practice, this and the approximate 
cost of each transition on each processor would be given by profiling information. 

There are two parts to our construction. Firstly, we produce a copy of the program for each p ^ P, 
omitting any transition rules r for which [p, r) C, giving box "B" of Figure [T] 

Secondly, we add transitions that correspond to possible data transfers (box "C")- This requires one 
rule per signal and interconnect pair. However, the higher-order nature of the Join Calculus means 
these need more careful definition than in our Petri-net work to preserve locality. Specifically, when a 
signal value such as k is transfened it needs to be modified so that it becomes local to the destination 
processor. This maintains the invariant that the 'computation transitions' introduced by the first part of 
the construction can only call signals on the same processor. 

4 Workers in place of Resource Constraints 

In the Petri-net version of this work, there was a third part to the construction. We introduced resource 
constraint places to ensure that each processor or interconnect only performed one transition at once. 
Equivalent signals would be illegal in the Join Calculus, as they would need to be matched on by tran- 
sitions from multiple definition instances (since processor time is shared between these). Changing the 
calculus to allow this would make it harder to generate an efficient implementation. Instead, we introduce 
the notion of workers to the semantics. 

Rather than allowing any number of transition firings to be mid-execution at a given time, we restrict 
each worker to performing zero or one firing at a time. We also tag each transition with the worker that 
may fire it. In our example, we would have four workers: x, y, (x , y) and (y , x) . The _x and _y copies 
of the original program are tagged with the x and y CPU workers respectively, while the data transfer 
transitions are tagged with the relevant interconnect worker. 

To accommodate vector processors such as CPUs, we augment n copies of an existing transition 
with a single merged transition. The new transition will take significantly less time than performing the 
n transitions individually. Obviously, a real implementation will not enumerate these merged transitions, 
but we can view it this way in the abstract. A similar argument also applies to data transfers, where we 
can benefit from doing bulk operations. 

This gives the formal semantics for our calculus as defined in Figure |2] We give this for an abstract 
machine, however just as Java trivially compiles to the JVM, our non-nested language can trivially be 
compiled to this JCAM. We also use a small step semantics rather than the ChAM |4| or rewriting |l5] 
style used previously, as this is more appropriate for our ongoing work on analysis and optimisation. 
Each of the workers can be either processing a transition rale, or IDLE. The initial state is for all workers 
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Domains: 

{f,t),{f, 9) G SignalValue = Signal x Time 

r g Environment = in(SignalValue x Value*) (messages available) 

/ G Time = No 
Z G GlobalState = Worker (LocalState U {IDLE}) 
{1,9,g) G LocalState = Label x Time x Value* (program counter, context, local stack) 

V G Value = SignalValue U Primitive 

Rules (judgement form of r,f,E — r',f',S'): 

r + A, t, E + {wK^ IDLE} ^r, f, I. + {w ^ {lo,d,vi ■ . . . ■v„)} (fire) 
where A = {((.A, 6), vl), . . . , ((/„, 6), v;,)} and r"' = /i (. . .)&...&/„(• • .){^o, • • •} 

r, t, E + {wi-^ (EMIT',0,v-i-(7)}^r+(i,v), f, I, + {w^^ (next(O,0,ff)} (emit) 

r, t, E+{h'i-^ (CDNSTRUCT</>',0,v-(t)} ^r+((/,f),v), r + 1, E + {wh^ (next(/),0,(7)} (construct) 

r, f, E+{wi-^ (LOAD.SIGNAL</>',0,ff)} -s-r, f, E + {w i-^ (next(Z), 0, (/, 0) • ff)} (load) 

r, f , E + {w 1-^ (finish', 0, a)} ^ r, f, E + {w i-^ idle} (finish) 

Figure 2: Non-Nested Join Calculus Abstract Machine (JCAM) Semantics 

to be IDLE, and some messages corresponding to program arguments to be available in P. Unmapped 
programs can be considered to have just a single worker. 



5 Future Work on Scheduling 

As before, we rely on a scheduler to be able to make non-deterministic choices corresponding to the 
fastest execution — and clearly these need to be made quickly. In this section, we briefly discuss our 
thoughts on this problem. 

It is clear that to optimise the expected execution time, we need transition costs, and also a probability 
distribution for the output signals of a transition. We believe that these could be effectively provided by 
profiling. This is already commonly used in auto-tuning (i.e. transition costs) and branch predictors (i.e. 
signal probabilities). It could also be used for ahead-of-time scheduling or just for determining baselines. 

For practical scheduling, it is most likely that a form of machine learning will be used to adapt to new 
architectures. This has been used successfully for streaming applications |9|, which are not dissimilar 
to a very restricted Join Calculus. Existing implementations of the Join Calculus have not considered 
the scheduling problem, and simply pick the first transition found to match. In order to maintain this 
simplicity, we would consider whether the output of such a learning algorithm could be a priority list of 
transitions, that evolves over time to offer some load balancing. 

Prior work has shown it is best to check for transition firings each time a message is emitted, rather 
than having a separate firing process [iTJ. This will result in a queue of transitions (perhaps picked by 
a first-found approach). For load balancing, idle workers could then steal from these transition queues. 
However, unlike in standard work stealing, there are two possible forms of stealing — as well as taking 
a matched transition, individual messages could be taken by first decomposing some of the existing 
matches. 
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6 Conclusions 

In this paper, we have adapted our existing work on mapping Petri-net programs to heterogeneous ar- 
chitectures to the Join Calculus. In doing so, we showed how to remove the problematic environments 
introduced by nested definitions, and also avoid global matching on resource signals by modifying the 
semantics slightly to incorporate workers. This allows programs to be agnostic of the architecture they 
will run on, with any placement and scheduling choices that depend on the architecture being left in the 
program. We have also listed several challenges for building a scheduler that can optimise over these 
choices, and initial ideas to solve them. Such an implementation is our current research goal. 
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